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Abstract 

Cirrus  is  a  tool  for  protocol  analysis.  Given  an  encoded  protocol  of  a  subject  solving 
problems,  it  constructs  a  model  that  will  produce  the  same  protocol  as  the  subject  when  it 
is  applied  to  the  same  problems  In  order  to  parameterize  Cirrus  for  a  task  domain,  the 
user  must  supply  it  with  a  problem  space:  a  vocabulary  of  attributes  and  values  for 
describing  spaces,  a  set  of  primitive  operators,  and  a  set  of  macro-operators.  Cirrus'  model 
of  the  subject  is  a  hierarchical  plan  that  is  designed  to  be  executed  by  an  agenda-based 
plan  follower.  In  this  paper,  the  philosophical  and  mathematical  foundations  of  Cirrus  are 
explored  * 
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1.  Introduction 


1.1.  Function  and  Structure:  Two  Alternatives  in  Describing  Cognition 


In  this  section  we  outline  a  distinction  between  Functional  and  Stnjctural  viewpoints  in 
psychological  theory.  We  postulate  that  the  source  of  the  distinction  Is  In  the  source  of 
constraints  that  inform  theory  building.  Subsequently,  we  propose  that  functional  level  theories 
should  hypothesize  only  functional  relationships  and  we  present  a  computational  model 
describing  tnir  process. 


Dreyfus  and  Dreyfus  (1987)  trace  the  history  of  two  alternative  approaches  in  the 

developement  of  the  new  sciences  of  cognition.  One  approach,  which  we  label  the  functional 
approach,  saw  that  there  was  a  common  level  of  description  between  the  symbol- 
manipulating  capacities  of  the  digital  computer,  and  the  apparent  symbol-manipulating 
capacities  of  the  human  information  processor  This  common  level  was  described  by  the 

Physical  Symbol  System  hypothesis  (Newell.  1980.  1982),  and  in  short,  stated  that  the 

necessary  and  sufficient  ingredient  for  intelligence  was  a  system  capable  of  manipulating 

symbols,  regardless  of  whether  this  system  was  implemented  in  silicon  or  organic  substrates. 
The  analogy  drawn  by  Newell  is  on  a  formal  level,  that  is.  the  bridge  between  brains  and 
computers  lies  in  the  their  representational  ability 


The  second  approach,  which  we  label  the  structural  approach,  drew  its  inspiration  not  from 
the  capacities  of  the  digital  computer,  but  from  the  emerging  neurological  sciences 
Rosenblatt  (1962)  considered  that  cognition  should  begin  in  the  processes  and  organization 
of  the  physical  system  underlying  intelligence 

It  is  both  easier  and  more  profitable  to  axiomatize  the  physical  system  and  then 
investigate  this  system  analytically  to  determine  its  behaviour  than  to  axiomatize  the 
behaviour  and  then  design  a  physical  system  by  techniques  of  logical  systhesis 
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(Rosenblatt,  1962;  In  Dreyfus  and  Dreyfus,  1987).  This  approach  Is  currently  represented  by 
the  burgeoning  school  called  connectionism,  which  utilizes  massively  parallel  connections  of 
simple  computational  neuron-like  units  to  model  cognitive  phenomena  (McClelland  and 
Rumelhart.  1986). 

These  two  approaches  have  different  points  of  view  as  to  what  constitutes  an  adequate 
basis  for  describing  cognition.  Currently  there  is  much  debate  over  the  veracity,  purpose  and 
usefulness  of  one  over  the  other  of  these  viewpoints.  The  difference  has  been  described  as 
a  symbolic-subsymbolic  distinction  (Smolensky,  1984,  1986)  although  both  classes  of  models 
are  clearly  symbolic  in  that  both  seek  to  represent  (symbolize)  certain  aspects  of  cognition. 
Others  have  argued  that  the  difference  Is  one  of  levels  of  description,  an  implementational 
level  as  constrasted  to  an  algorithmic  level  of  description  (Anderson,  1987:  Marr,  1982).  We 
prefer  to  avoid  this  terminology  since  it  implies  a  definite  causal  ordering;  i.e.,  that  the 
implementational  level  merely  exists  to  implement  pre-existing  algorithms.  To  avoid  this 
implication,  we  have  termed  the  two  viewpoints  functional  and  structural. 

We  wish  to  argue  that  the  difference  between  the  two  classes  of  theory  stems  from  the 
different  sources  of  constraint  that  informs  the  two  classes  of  theory.  The  functional 

approach  argues  that  representation  is  the  basis  of  cognition  Cognition  should  be 
described  in  terms  of  the  formal  structures  capable  of  manipulating  the  relationship  between 
such  symbolic  objects  to  achieve  the  logical  requirements  for  computation.  Smolensky 

M986)  describes  this  as  follows:  we  have  theories  with  descriptive  entities  such  as  formal 
logic,  that  capture  human  information  processing  in  some  high  level  domains  (such  as 
mathematics  and  circuit  analysis).  Symbolic  theorists  attempt  to  extend  this  high  level  of 
description  "down  the  abyss",  to  attempt  to  describe  the  vast  middle  ground  of  cognition 
for  ^nich  there  is  no  formal  domain  theory,  in  terms  of  rules  and  effective  procedures 
Thus  the  source  of  constraints  for  these  tvnes  of  theories  is  the  actual  task  domain  This 
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appeals  to  a  more  functional  level  of  analysis,  where  the  constraints  for  a  computational 
theory  are  drawn  from  the  functional  requirements  of  the  task,  (c.f.,  Chomsky's  notion  of 
competence). 

The  connectionists  on  the  other  hand,  stated  that  cognition  was  an  emergant  property  of 
the  arrangement  of  the  physical  units  of  the  brain.  Their  paradigm  is  that  intermediate 
cognition  is  of  the  same  kind  as  low  level  perceptual  processing  and  is  well  described  by 
reference  to  low-level  constraints  on  the  type  of  computation  that  can  be  performed  by  the 
brain  These  constraints  are  dictated  by  structural  considerations,  such  as  the  parallel 
nature  of  neuronal  structure  and  computational  abilities  of  richly  interconnected  units  with 
only  local  information.  Thus,  connectionists  attempt  to  climb  "up  the  abyss"  to  the  mid¬ 
ground  of  cognition  from  a  structural  level  of  description  whose  constraints  derive  from 
neural  considerations. 

Our  current  concern  is  not  to  contrast  these  two  theoretical  approaches,  but  to  draw  an 
implication  from  the  distinction  The  functional  approach  describes  the  competencies 

required  by  certain  tasks,  while  implementations!  level  theories  give  structural  accounts  of 
how  such  functional  competencies  may  be  achieved  in  a  neurally  plausible  way.  We  argue 
that  this  distinction  should  be  taken  seriously  in  cognitive  science,  and  should  dictate  the 
kind  of  theoretical  entities  that  are  appropriate  at  each  level  We  further  wish  to  argue  that 
the  model  of  cognition  presented  here,  which  certainly  falls  within  the  functional  approach, 
demonstrated  how  this  distinction  might  be  taken  seriously  it  does  so  by  presenting  a 
method  of  inducing  the  functional  competancy  of  a  particular  domain  skill  without  invoking 
hypothesized  structure 

The  argument  is  as  follows  Some  domains  of  human  compe’enre  are  ruie-like  the. 
embody  a  theory  or  an  effective  procedure  Usually  this  rule  structure  is  something  derived 
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from  the  environment  and  serves  to  describe  the  structure  of  the  task.  For  example,  there 
are  rules  for  performing  subtraction  procedures.  These  rules  are  codified  in  the  teaching 
environment  and  the  succesful  learner  is  one  who  suceeds  in  internalizing  these  rules. 

Functional  theories  are  capable  of  describing  this  level  of  rule-using  competence.  They 
are  descriptive,  detailing  at  the  level  of  logic  and  computation,  the  abilities  required  for 
executing  a  particular  task.  Descriptive  devices  to  describe  such  abilities  are  rules  and 
representations.  For  example,  consider  the  production  rule  formulation  from  Anderson's  well- 

known  ACT'  theory  (1983). 

IF  the  goal  is  to  subtract  the  digits  in  a  column 
and  the  subtrahend  is  larger  than  the  minuend 
THEN  set  a  subgoal  to  borrow. 

This  rule  formalizes  a  particular  piece  of  competency  required  to  accomplish  that  task.  Note 
that  as  such,  it  is  descriptive  of  the  task  domain  rather  than  of  the  performer  of  the  task. 

A  collection  of  such  rules  can  describe  the  competency  required  to  perform  a  particular 

skill  In  a  certain  domain  To  verify  completeness  and  sufficiency,  such  rules  can  actually  be 

"run"  in  a  computational  formalism  called  a  production  system,  which  effectively  provides  the 
control  structures  to  determine  the  order  of  firing  of  these  rules,  and  provides  the  storage 
facilities  to  keep  track  of  inputs  and  partial  answers.  We  argue  that  the  production  system 

itself  is  useful  in  determining  the  sufficency  of  the  postulated  set  of  rules. 

One  problem  with  this  approach  to  determining  functional  knowledge  is  that  the  rule  set 
must  be  deduced  apriori  For  domains  which  have  a  clear  formal  domain  thenrv  this  i«  not 
a  huge  problem  (c . f . .  the  rule  sets  for  geometry  proofs  in  Anderson  et  al.  1981'  m  effect 
determining  the  knowledge  that  a  problem-solver  has  in  a  domain  then  becomes  a  generate 
and  test  cycle  We  generate  a  set  of  rules  and  then  test  them  m  a  production  system  >e 
determine  the  sufficiency  of  the  postulated  rule  set  Note  that  no  claims  for  necessity  can 
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be  made  for  such  a  derivation  method. 

An  alternative  to  this  approach  would  be  to  attempt  to  determine  this  functional  knowledge 
from  the  actual  performance  of  a  problem  solver.  Rather  than  extract  the  proposed 
competence  from  the  task  domain,  it  may  be  instructive  to  derive  this  knowledge  from  the 
student.  We  may  be  able  to  look  through  the  students  eye's  at  the  acquired  domain 
knowledge,  rather  than  just  postulating  necessary  knowledge. 

A  secondary  concern  with  the  functional  approach  is  that  the  mechanism  for  determining 
the  completeness  of  the  rule  set  (e.g. .  the  production  system  interpreter)  may  start  to  look 
attractive  as  the  structural  basis  with  which  such  competence  is  achieved.  By  eliminating 
the  generate  and  test  cycle,  and  deriving  the  rule  set  that  characterizes  a  task  domain  from 
actual  performance  data,  we  eliminate  the  need  for  an  executing  mechanism  to  gaurantee 
sufficiency,  and  thus  avoid  the  need  for  statements  about  structural  hardware  employed  by 
the  problem  solver.  This  avoids  the  problem  of  making  structural  propositions  based  on  a 
functional  source  of  constraints.  We  return  to  this  problem  in  our  conclusions. 

The  model  of  cognition  presented  here  directly  confronts  the  need  to  extract  the  functional 
knowledge  of  a  problem  solver,  without  being  seduced  into  making  structural  assumptions. 
(We  will  later  argue  that  structural  hypotheses  should  be  made  within  structurally  constrainted 
theories).  This  paper  presents  a  theory  of  how  the  description  of  the  mind  should  proceed. 
We  describe  how  a  theory  of  cognition  at  the  symbolic  level  should  be  descriptive  of  what  is 
the  mental  competence  of  the  subject  under  exploration  in  the  next  section,  we  detail  what 
we  see  as  the  principle  assumptions  required  by  models  of  cognition  that  recognize  their 
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2.  Fundamentals  of  Descriptive  Cognition 

At  the  "symbolic  timescale"  (when  we  consider  cognitive  actions  in  .5  to  10  second 
chunks),  some  forms  of  cognitive  behavior'  can  be  meaningfully  described  as  a  series  of 
operators  that  move  the  cognizer  through  a  sequence  of  states,  leading  to  a  more  desired 
state,  namely  the  goal  of  the  sequence  of  actions.  This  formulation  was  proposed  initially  by 
Newell  and  Simon  (1972)  and  was  termed  the  Problem  Space  hypothesis.  They  characterized 
problem  solving  and  other  phenomena  as  search  within  the  problem  space,  which  consisted 
of  all  the  possible  knowledge  states  in  a  particular  domain,  and  which  was  traversed  by  the 
application  of  operators  to  transform  one  state  into  the  next.  The  path  so  traced  through  the 
hypothesized  problem  space  was  termed  a  solution  path.  The  problem  space  hypothesis 
suggests  that  we  may  understand  problem  solving  behavior  in  terms  of  operator  application 
sequences  leading  to  the  goal  state. 

How  can  we  understand  what  gave  rise  to  such  a  sequence  of  behaviours?  We  posit  that 
the  task  for  a  functional  theory  of  cognition  is  to  describe  the  states  that  lead  to  such  a 
sequence  of  behavior  Historically  this  was  also  the  task  that  the  behaviorist  tradition  set  for 
itself  They  proposed  that  to  understand  the  responses  made  by  an  organism,  one  only  need 
know  the  stimulus  conditions  holding  at  the  time  of  a  response,  since  what  the  cognizer 

really  knew  was  an  association  between  this  stimulus  configuration  and  the  response.  This 

notion  is  embodied  m  Clark  Hull's  formulation  of  the  famous  rule  Behavior  =  F  (Stimulus 
intensity.  Drive  Habit  Strength.  Incentive  Reinforcement),  stating  this  association 

in  line  with  this  behaviorist  credo,  we  accept  that  there  will  be  features  of  the  stimulus 

situation  that  will  be  causally  related  to  the  particular  response  of  the  organism  further 

understanding  such  an  association  is  a  fruitful  way  to  describe  and  understand  me 

'’’a'  5  3»"3v'C  3«Qu«r>ces  ‘nr  which  ’hero  0*ist$  3  '^rrrai  do^am  ‘heO'V 
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competence  of  the  subject.  This  leads  to  our  first  principle  of  functional  cognitive  theory. 

.Cognitive  behavior  can  be  described  as  a  sequence  of  State  (SJ  ■  Operator  (Cy  pairs  such 

that  there  exists  a  function  F  that  maps  from  SI  to  Or  We  term  this  the  Regularity  principle. 

However,  since  we  are  proposing  a  theory  of  cognition  rather  than  behavior,  we  need  to 

consider  more  than  just  the  external  state  at  the  time  of  the  behavior.  We  must  also 

consider  features  of  the  internal  state,  attributes  describing  the  state  of  the  information 

processor  Internal  state  attributes  may  describe  features  relating  to  goals  of  the  organism, 
partial  results,  past  history  and  the  state  of  the  processor  itself.  Thus,  the  Regularity 
principle  needs  to  be  extended  to  include  internal  state  such  that  any  response  of  the 
organism  is  best  described  by  the  salient  features  of  internal  state  and/or  external  state.  This 
leads  to  our  next  assumption  about  function  theory. 

Our  second  principle  concerns  the  nature  of  the  response  of  the  organism.  Just  as  a 

cognitive  theory  needs  the  capacity  to  represent  internal  state,  it  also  needs  the  capacity  to 
represent  internal  operators  that  may  change  internal  state.  This  is  just  another  way  of 
saying  that  observed  behavior  is  not  necessarily  the  result  of  a  unitary  internal  act.  but  that 
multiple  internal  states  and  operators,  in  a  sequence  of  state-operator  changes,  may  have 

preceded  the  externally  observed  behavior.  Much  has  been  written  about  the 

decomposab'lity  of  complex  skills,  generally  under  the  rubric  of  hierarchical  goal  structured 

behavior  The  most  convienient  way  of  describing  such  decomposability  is  by  way  of  a 

grammar  We  title  our  second  principle  the  Decomposition  principle 

To  summarize  the  two  principles  that  we  have  suggested  form  the  basis  of  a  descriptive 

functional-level  theory  of  cognition,  we  have  suggested  firstly,  that  cognitive  behavior  at  'he 

symbolic  level,  may  best  be  understood  by  knowing  features  of  the  internal  and  external 

world  that  were  significant  at  the  time  of  the  behavior  (the  Regularity  principle).  Secondly 
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we  acknowledge  the  need  to  describe  internal  operations.  Note  that  we  do  not  feel 
constrained  to  say  how  these  operations  may  be  implemented  Rather  they  exist  because  of 
task  demands  and  external  theories  that  the  learner  has  internalized.  They  correlate  with 
theories  of  task  decomposition.  These  internal  task  decompositions  may  best  be  described 
as  a  grammar,  and  so  we  label  our  second  principle  the  decomposition  principle.  These 
two  principles  In  themselves  do  not  consitltute  a  functional  theory  of  cognition.  In  fact,  we 
label  them  principles  since  they  are  not  falsifyable.  However,  they  do  establish  the 
framework  for  a  descriptive  functional  theory  of  cognition,  which  we  consider  in  the  next 
section 

2.1.  A  Functional  Theory:  Fundamental  Hypotheses 

We  have  just  considered  how  a  functional  theory  of  cognition  should  determine  the 
relevant  parameters  of  behavior  which  are  Operator  =  F  (State  internal,  State  extemaJ-  In  this 
section  we  consider  some  of  the  assumptions  necessary  to  mechanistically  determine  this 
relationship  We  do  this  by  detailing  the  assumptional  basis  for  a  computational 
implementation  of  this  theory  called  Cirrus. 

External  state  is  given  as  data,  but  how  are  we  to  determine  internal  state7  Internal  state 
is  determined  by  a  theory  of  domain  structure  That  is,  the  specification  of  internal  state  is 
a  theory  given  by  the  experimenter  as  to  how  a  certain  competency  may  be  achieved.  In 
the  procedural  skills  that  we  will  discuss,  the  internal  state  theory  describes  how  the  top- 
level  task  is  decomposed  into  smaller  task  steps,  it  is  reasonable  to  assume  that  complex 
tasks  particularity  those  taught  in  formal  education,  have  an  associated  method  for  divide 
and  conquer"  specifications  how  difficult  problems  can  be  reduced  to  simpler  problems  fcr 
which  the  student  has  already  acquired  competence  This  task  decomposition  can  be 
described  as  a  grammar  hypothesis,  which  describes  the  hierachical  structure  of  the  task 
The  grammar  specifies  how  a  given  skill  is  decomposed  hierarchically  to  give  internal  goal 
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states. 

We  can  decompose  the  knowledge  implied  by  the  grammar  into  two  components.  One 
component,  the  goal  component  is  the  actual  task  decomposition,  its  goal  structure  This 

component  is  given  by  the  rewrite  rules  of  the  grammar.  However,  grammers  may  be  non- 
deterministic.  That  is,  one  left  hand  side  (LHS)  clause  may  expand  into  multiple  right  hand 
side  (RHS)  rules.  This  component  may  be  labelled  the  method  knowledge,  which  of  a  set  of 
possible  operations  to  apply.  We  can  extend  the  grammar  to  be  an  annotated  grammar  to 
include  this  information.  In  the  present  model,  we  assume  this  information  to  be  included  in 
the  operators,  which  contain  general  base  restrictions  for  when  they  are  appropriate 

Adopting  this  approach  simplifies  the  models  constructed.  For  example,  a  grammar  rule 
might  specify  that  the  operator  SublCol  could  be  replaced  by  either  ShowTop  or  Difference. 
Which  operator  was  produced  by  expanding  this  rule  is  dictated  by  conditions  for 
applicability  for  the  operator  (i.e  ,  that  the  bottom  digit  equals  0  for  ShowTop). 

In  addition  to  a  specification  of  the  task  decomposition  rules,  we  need  to  specify  an 

ordering  hypothesis.  This  hypothesis  covers  the  way  we  theorize  that  the  skill  performance  is 
ordered,  that  is.  what  determines  the  order  of  subgoal  expansion.  As  a  model  of  how  such 
a  goal  tree  may  be  ordered,  we  could  appeal  to  a  stack  model  of  subgoal  processing  (c.f.. 
Soar.  (Laird.  Rosenbloom  and  Newell.  1984,  1985,  1986;  Rosenbloom.  Laird.  McDermott. 

Newell  and  Orciuch,  1985)  with  its  universal  subgoaling).  That  is.  the  order  of  expanding 
subgoal  nodes  is  given  by  a  deterministic  order 

However,  we  feel  that  the  stack  model  is  a  special  case  of  the  more  general  agenca 
model,  in  which  the  order  of  subgoal  expansion  is  non-deterministic  Roughly  the  agenda 
ordering  hypothesis  states  that  the  subgoal  operators  expanded  horn  a  goal  are  executed  m 
an  order  that  is  not  specified  by  the  goal  tree  Note  that  the  assumption  of  an  agenda 
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mechanism  does  not  imply  that  we  hypothesize  that  there  is  an  "agenda  scheduling  device" 
hardwired  into  the  brain.  Rather,  we  feel  this  is  a  suitably  general  descriptive  formalism  to 
describe  how  internal  states  may  be  ordered.  That  they  need  ordering  is  assumed  from  the 
fact  that  skills  are  described  by  a  hierarchical  task  structure  (decomposition  principle). 

Finally,  we  assume  that  states  can  be  adequately  described  by  attribute-pair 

representations.  That  is,  for  each  state,  we  can  specify  the  universe  of  attributes  that 
adequately  represent  the  complete  set  of  properties  and  relations  and  their  value.  We 
interpret  an  "attribute"  rather  loosely:  any  piece  of  information  at  ail  can  be  expressed, 

limited  only  by  what  is  considered  relevant  to  the  domain.  For  example,  we  could  postulate 
an  attribute  that  a  cetaln  piece  of  information  was  in  memory  with  a  value  true  or  false,  or 
that  certain  actions  had  just  taken  place,  or  that  something  existed  or  had  existed  in  the 
external  environment.  It  is  considered  crucial  to  actually  encode  all  such  facts  that  may  bear 
on  the  task  domain. 

All  these  assumptions,  the  grammar  hypothesis,  ordering  hypotheses  and  representation 
assumption,  come  under  the  decomposition  principle  These  allow  us  to  specify  a  theory  of 
internal  operators,  that  we  can  combine  with  the  data-given  external  operators.  We  can  also 
specify  internal  state  given  this  theory.  Of  course,  the  task  theory  and  the  hypotheses  we 
use  to  operationalize  it  may  be  wrong,  and  later  we  discuss  how  theories  may  be  accepted 
or  rejected  in  a  proposal  for  hypothesis  testing. 

However,  our  descriptive  task  is  not  yet  complete.  Even  though  we  now  have  a  wav  to 

specify  which  states  go  with  which  operators,  this  information  is  not  useful  A  state 

description  contains  the  universe  of  attributes  applicable  to  a  domain  which  may  be  true  or 
false  (or  valued)  at  that  given  time  The  regularity  principle  leads  us  to  infer  that  out  of  mis 
universe  of  attributes,  some  will  be  meaningfully  (perhaps  causally)  related  to  the  operator 
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application.  Thus  the  second  function  of  a  descriptive  theory  of  cognition  must  be  to  define 
a  method  to  extract  the  most  informative  attributes  from  the  complete  universe.  This  is 
clearly  an  inductive  task  over  the  universe  of  attributes  The  question  is,  which  attributes  are 
most  highly  predictive  of  an  operator  application.  Thus  the  basis  of  the  model  is  a  general 
induction  algorithm,  in  that  the  most  predictive  features  will  also  be  the  most  general 
features  associated  with  an  operator  application.  The  model  employs  an  induction  formalism 
to  determine  which  features  of  internal  and  external  state  are  the  most  predictive.  There  is 
a  parallel  between  the  notion  of  concept  formation,  with  its  selection  of  the  most  general 
features  of  the  instances  of  a  concept,  and  the  inductive  task  as  it  is  performed  by  Cirrus 
This  parallel  is  discussed  subsequently  in  the  section  on  induction  and  Psychological 
Modelling.  The  way  in  which  the  inductive  generalization  is  accomplished  is  discussed  below 
under  Attribute  Selection. 

3.  Induction  and  Psychological  Modelling:  a  brief  review 
Cirrus  utilizes  an  induction  formalism  known  as  decision  trees  (Dtrees).  This  formalism  has 
a  rich  psychological  background  which  is  traced  in  this  section  Dtrees  had  their  first 

psychological  application  in  the  work  on  concept  formation.  The  paradigm  for  this  line  of 
research  was  established  by  Bruner,  Goodnow  and  Austin  (1956).  They  studied  how  subiects 
form  hypotheses  about  a  concept  on  the  basis  of  positive  and  negative  exemplars.  In  these 
studies,  concepts  and  thereby  concept  attainment  was  defined  by  critical  attributes  of  the  set 
of  objects  to  be  classified.  Subjects  were  presented  cards  containing  objects  that  could 
vary  along  four  dimensions:  number  of  objects,  their  shape  'heir  colour  and  the  number  of 
borders  surrounding  them  Subjects  had  to  discover  a  concept  ii  e  cetam  values  on  i  or 
more  of  the  four  attributes)  that  covered  the  positive  instances  presented  but  none  of  'he 
negative  instances  To  do  so.  they  had  to  identify  which  attributes  were  relevant  'attribute 
identification)  and  the  kind  of  rule  which  connects  those  attributes  (conjunctive,  disjunctive  or 
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relational)  (Anderson,  1980).  Cirrus  applies  the  notion  of  critical  attributes  of  a  concept 

t 

l 

formation  rule  to  the  critical  features  of  a  rule  of  operator  application, 

i 

f 

i 

The  combination  of  attributes  selected  was  used  as  a  rule  for  the  application  of  a 

particular  concept.  As  an  extension  of  this,  concepts  can  also  be  represented  as  a 

sequence  of  tests  of  the  values  of  individual  attributes.  Simon  and  Feigenbaum  (1979)  i 

employed  such  a  formalism  in  their  EPAM  model  of  perceptual  recognition.  Tests  on  \ 

attributes  of  the  stimuli  led  naturally  to  a  decision  tree  representation,  where  attributes  form  J 

i 

a  test  at  each  node  of  a  tree,  whose  branches  are  the  values  of  that  attribute.  A  concept 
is  thus  sorted  down  successive  branches  of  the  tree,  until  it  arrives  at  a  leaf  of  the  tree. 

The  name  of  the  leaf  is  the  label  for  the  concept  identified  while  the  denotation  of  that 
name  is  the  set  of  concepts  sorted  to  that  leaf.  Thus  the  concept  is  the  decision  rule 
formed  by  tracing  that  path  in  the  dtree.  Cirrus  adopts  an  EPAM-like  tree  representation, 
except  that  operator  classes  instead  of  concept  classes  are  the  labels  of  the  leaves  of  the 
tree 

The  EPAM  model  was  an  incremental  model  of  concept  formation  unlike  the  model 
formulated  by  Hunt.  Marin  and  Stone  (1966)  in  their  Concept  Learning  System.  CLS  was 
intended  to  solve  single-concept  learning  tasks,  the  learned  procedure  then  being  capable  of 
classifying  new  instances  They  presented  the  decision  tree  process  with  a  complete  set  of 
examples  initially  For  Hunt.  Marin  and  Stone  (1966)  CLS  was  a  device  that  discovered 
rules  for  combining  previously  learned  concepts  (attributes)  to  form  a  new  decision  rule 
Similarity.  Cirrus  is  presented  with  all  the  positive  and  negative  instances  at  one  time  Thus 
Cirrus  is  not  a  model  of  the  acquisition  process  for  operator-state  pairing  knowledge  ra'her 
it  represents  that  knowledge  of  the  problem  solver  as  defined  bv  'heir  performance  at  'hat 
pomt  m  time  Quinlan  (1983.  1986)  extended  the  CL5  model  to  deal  wth  noise  n-ar, 
attributes  rather  than  binary,  and  with  complex  relational  attributes  rather  than  the  feature- 
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value  representation  used  in  CLS.  These  improvements  in  the  Dtree  formalism  were  adopted 
in  Cirrus,  except  that  the  feature-value  representation  format  was  retained. 

Langely,  Ohlsson  and  Sage  (1984a,  1984b)  have  applied  the  decision  tree  formalism  to 
modelling  of  student  data  in  a  similar  manner  to  that  presented  here.  However  there  are 
some  important  differences  between  the  two  models.  Their  model  (ACM)  takes  as  input,  the 
answers  to  a  problem  domain  rather  than  the  actual  solution  paths,  and  ACM  determines 
the  solution  path  taken  to  arrive  at  that  answer.  It  chooses  the  minimum  cost  solution  path, 
which  is  an  untenable  assumption  for  students.  For  example,  even  in  the  simplified  world  of 
subtraction  they  employ,  actual  students  show  marked  variation  in  solution  paths  while 
arriving  at  the  correct  answer.  In  more  complex  domains  such  as  algebra,  the  varience  in 
solution  paths  is  quite  astounding,  even  given  students  with  the  same  learning  history. 

ACM  forms  separate  decision  trees  for  each  operator,  in  contrast  to  Cirrus  which  forms 
only  one  Dtree  to  classify  all  the  operators  for  a  particular  skill.  In  ACM's  case,  attributes 
are  given  positive  or  negative  values  according  to  whether  they  were  present  when  the 
particular  operator  was  applied  States  are  then  sorted  down  the  tree  according  to  whether 
they  were  a  positive  instance  of  operator  application  or  a  negative  instance.  In  Cirrus,  the 
decision  tree  is  formed  over  the  whole  problem  space  rather  than  over  an  individual 
operator  For  Cirrus,  adequate  classification  is  assured  since  discrimination  is  incomplete 
(and  marked  as  such)  unless  each  leaf  of  the  dtree  contains  only  one  class  of  operator  in 
Langley  et  al  s  formalism,  inadequate  discrimination  between  operators  is  not  addressed 
Finally.  ACM  employs  a  different  mechanism  for  selecting  the  next  attribute  in  building  the 
tree  This  point  will  be  discussed  in  more  detail  under  Attribute  Selection  However  the 
maior  distinction  between  the  two  models  is  in  the  fact  that  Cirrus  empiovs  real  crotccci 
data  rather  than  hypothesized  or  ideal  data  This  motivated  many  of  the  features  of  Cirrus 


such  as  the  noise-filter  added  to  the  attribute-selection  mechanism  Overall,  the  inductive 


Descriptions  of  Mind 


16 


attributes  could  be  recoded  in  a  binary  format,  this  is  conceptually  non-problematic,  but 
psychologically  is  less  plausible.  The  procedure  used  by  the  CLS  falls  into  the  class  of 
category  validity  methods  of  attribute  selection,  after  Rosch's  (1975)  notions  of  family 
resemblance.  Category  validity  methods  seeks  to  maximize  the  coverage  of  each  critical 
attribute  over  all  the  positive  instances  (Smith  and  Medin,  1981).  That  is,  a  feature  is  high 
in  category  validity  to  the  extent  of  the  number  of  instances  of  the  category  contain  that 
feature.  Although  much  research  supports  the  category  validity  model  (Rosch  and  Mervls. 
1975  ;  Medin.  Wattenmaker  and  Mlchalski;  in  press),  the  CLS  procedure  for  attribute 
selection  is  computationally  intractable,  since  it  can  involve  up  to  3  serial  searches,  which 
seems  uneconomic  if  the  attribute  space  is  large. 

A  more  psychologically  satisfying  method  was  employed  by  Langley  et  al  (1984)  in  ACM 

Their  system  computes  the  number  of  oositive  instances  matching  a  given  test  (M  +  )  and  the 

number  of  negative  instances  failing  that  test  (U-),  and  the  total  number  of  positive  (T  +  )  and 

negative  instances  (T-)  ACM  calculates  the  sum  of  the  proportion  of  instance  class  to  total 
M  U 

class  S  =  + and  then  computes  E  =  maxiS.2  ■  S)  It  is  easy  to  see  that  an  optimal 

T+  T 

test  would  be  one  that  completely  discriminated  all  the  positive  and  negative  examples  Such 
a  test  would  receive  a  score  of  1  *  1  »  2.  Similarity,  a  test  with  no  discrimination  would 
match  only  half  the  positive  as  well  as  half  the  negative  examples,  and  so  would  score  one 
0  -  S  gives  the  discriminability  of  negated  tests. 

This  method  works  in  the  case  of  single  attribute  selections  but  fails  when  multiple  attribute 
selections  are  considered  To  illustrate  the  shortcomings  of  the  above  evaluation  funchon  ,\e 
shall  treat  the  dtree  as  an  information  source,  that  is.  the  dtree  can  be  seen  as  Drovidmg  a 
classification  and  hence  information  Information  theory  tells  us  that  an  act  is  informative 
the  extent  that  it  reduces  uncertainty  and  the  amount  of  information  received  is  proportional 
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formalism  that  Cirrus  uses  to  extract  information  from  the  universe  of  state  features  has  a 
long  history  In  the  psychological  literature. 

3.1.  Attribute  Selection 

The  essential  element  of  the  regularity  principle  is  to  extract  the  most  informative  aspects 
of  the  external  and  internal  state  that  predict  the  application  of  an  operator  This  section 
details  the  process  of  selecting  these  features  from  the  universe  that  characterizes  the  state 

Dtrees  work  inductively  to  make  classificatory  decisions.  In  this  case  the  tree  contains 
nodes  for  the  critical  attributes  only,  rather  than  specifying  the  universe  of  attributes  U  Such 
a  partial  description  may  be  realized  by  several  equivalent  trees.  Different  attribute  sets  may 

adequately  classify  the  universe  of  objects,  but  some  will  be  better  than  others  The 

mechanism  that  determines  the  quality  of  the  generalization  is  that  which  chooses  the 
attribute  to  discriminate  at  sucessive  roots  of  the  Dtree  Several  methods  have  been 
employed  in  the  previously  discussed  models  and  these  are  compared  further  in  this  section 

Hunt  et  ai  (19661  proposes  criteria  based  on  cost  minimization  costs  associated  with 
measurement,  complexity  of  the  tree  and  understandability  for  their  CLS  model  The  optimal 
attribute  for  the  root  of  the  tree  was  chosen  by  a  look-ahead  method  This  was 

accomplished  as  follows 

1  A  search  was  made  for  an  attribute  value  which  appeared  in  all  the  positive 
instance  descriptions  and  never  m  the  negative  instance  descriptions  The  dtree- 
building  method  halted  if  such  an  attribute  was  found  since  it  completely 
differentiated  the  world  of  ob|ects 

2  if  this  search  failed,  the  above  procedure  was  then  applied  to  negative  obiecs 

3  if  both  steps  i  and  2  fail  then  the  attribute  which  has  the  highest  f'eaueno. 
for  a  given  value  was  selected 

Such  a  procedure  assumes  that  only  binary  attribute:  are  encoded  Since  all  n-ary 
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to  the  extent  of  uncertainty  about  the  information  to  be  received.  Consider  the  following 

example  taken  from  Garner  (1962).  If  a  coin  is  tossed  in  the  air,  then  the  uncertainty  of 
that  message  is  obviously  determined  by  the  number  of  possible  outcomes,  in  this  case  two. 
if  instead,  a  die  is  rolled,  then  the  uncertainty  would  be  greater  since  there  are  six  possible 
outcomes.  Thus,  the  roll  of  the  die  contains  more  information  potential  than  the  toss  of  the 
coin,  since  the  outcome  of  the  die  is  more  uncertain. 

Thus  one  measure  of  information  would  be  the  uncertainty  of  the  event,  measured  by  the 
number  of  outcomes  for  that  event  This  is  the  measure  suggested  by  Langley  et  al  (1984) 
for  ACM.  since  an  attribute  is  selected  as  more  informative  if  it  accounts  for  more  of  the 
total  uncertainty  (i  e  ,  the  sum  of  postive  and  negative  outcomes  it  correctly  discriminates). 
However  consider  the  case  when  two  coins  or  two  die  are  tossed  in  the  air  Two  coins 
have  4  possible  outcomes  whereas  two  dice  have  36  possible  outcomes.  Further,  three  coins 
do  not  provide  us  with  6  units  of  information  potential  (i.e. .  states  of  uncertainty)  but  8 

Thus  a  simple  linear  additive  model  of  accumulated  information  is  clearly  inadequate.  The 
number  of  possible  outcomes  of  an  event  does  not  give  a  measure  of  the  uncertainty  of 

that  event  The  measure  which  satisfies  the  above  state  of  affairs  must  be  a  logarithmic 

model,  rather  than  the  linear  model  employed  by  ACM.  This  is  so  because  a  logarithmic 
function  is  monotomcally  related  to  the  number  of  outcomes  and  each  successive  event  adds 
the  same  amount  of  uncertainty  as  preceeding  events 

This  leads  to  the  definition  of  the  bit.  the  basic  measurement  unit  of  ^formation  The 
uncertainty  U  of  an  event  and  hence  its  potential  for  carrying  information  can  be  measured 
by  (  '=  r  Inn  k  //here  k  is  the  number  of  outcomes  of  in  event  and  c  >s  a  crcpor!'cna>i!. 
constant  it  'S  accepted  practise  m  information  theory  'o  utilise  base  2  'cganmms  m-u  ■- 
define  the  unit  of  measurement  so  that  c  =  1  Thus  f  =  /< >c ,  L  intuitively  one  bit  of 


■  vv>/:vyy:/yv:^^ 
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information  enables  us  to  decide  between  two  outcomes,  whereas  one  bit  of  uncertainty 
involves  doubling  the  number  of  categories  of  an  outcome. 

The  above  formula  for  uncertainty  assumes  that  each  value  in  k  has  an  equal  chance  of 
occuring.  Since  the  probability  p(x)  of  any  one  event  occuring  is  the  reciprocal  of  the  total 
number  of  values  In  k,  then  U  =  log2  1  (Garner,  1962).  Thus  U  =  • log2p[x )  Is  the  measure 

pl.t) 

of  average  uncertainty  when  all  categories  of  x  are  equally  likely.  When  there  is  a  discrete 
probability  distribution  for  x,  the  average  uncertainty  is  computed  by  determining  the 
uncertainty  for  each  item  and  obtaining  a  weighted  sum  of  these  uncertainties.  This  weight 
will  be  just  the  probability  of  that  category  occuring.  This  transformation  thus  gives  us 
Shannon's  measure  of  average  information:  Cl.t)  =  -S  pl.rl  log2  pl.t). 

The  preceding  formula  allows  us  to  measure  how  informative  any  particular  feature  of 
internal  or  external  state  will  be  in  deciding  the  correct  operator  to  apply  A  derivation  of 
this  measure  is  used  in  Cirrus  to  select  the  next  attribute,  the  details  of  which  will  be 
discussed  m  a  later  section  The  point  here  was  to  demonstrate  the  superiority  of  an 
information  theoretic  measure  over  the  linear  probability  method  employed  by  Langley  et  al 
particularity  when  combinations  of  attributes  must  be  considered  With  regard  to  Smith  and 
Medm  s  (1981)  classification  of  attribute  classes  this  measure  provides  a  basis  for  attribute 
selection  according  to  the  cue  validity  class  of  models  such  that  each  attribute  seeks  to 
maximize  the  discrimmability  of  the  concept  Thus,  a  feature  is  selected  according  to  how 
wen  't  differentiates  the  application  of  an  operator  ic  f  concept)  from  the  space  of  an  the 


domain  operators  We  will  return  to  a  discussion  of  cue  validity  vs  category  validity  models 
of  attribute  selection  m  discussing  shortcomings  of  the  model 


Descriptions  of  Mind 


19 


3.2.  Summary 

In  this  section,  we  have  argued  that  we  could  understand  the  process  of  operator 
selection,  termed  search  control  if  we  knew  what  were  the  critical  features  of  the  internal 
state  (features  related  to  a  domain  theory  of  task  decomposition)  and  the  external  state  (the 
state  of  the  problem  or  task).  We  claimed  that  although  external  features  are  given  by  the 

data,  internal  features  need  to  be  generated  by  a  domain  theory  of  task  performance.  This 

theory  related  to  the  decomposition  of  the  top-level  task  into  a  subgoal  hierarchy  that  is 
specified  by  the  grammar  hypothesis.  Ordering  of  the  task  was  scheduled  by  an  agenda 
(agenda  hypothesis)  and  the  states  so  produced  were  encoded  by  an  attribute-value 
representation 

To  determine  which  of  these  feature  were  most  informative,  we  argued  that  an  information 
theoretic  measure  is  the  most  appropriate  to  inductively  select  the  critical  features  from  the 
universe  of  internal  and  external  features  present  at  the  time  of  each  operator  application 
These  assumptions,  grouped  under  the  grammar  principle  and  regularity  principle,  allow  the 
mechanization  of  determining  the  functional  association  of  states  and  operators.  In  the 
following  sections,  we  first  discus  the  actual  implementation  of  Cirrus  as  a  computer 

program  and  then  we  illustrate  its  use  in  analysing  solution  path  protocols  from  the  domain 

of  subtraction 

4.  Computational  Details  of  Cirrus 

Cirrus  is  a  multi-stage  serial  process  with  the  following  stages 

•  Data  Encoding 

•  State  Parsing 

•  Attribute  Encoding 

•  Decision  Tree  construction 


Application 
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The  following  section  describes  each  of  these  processing  stages  in  turn.  First,  some 
general  comments  will  be  made  on  the  computational  basis  of  Cirrus.  Overall,  Cirrus  is  an 
example  of  the  class  of  similarity-based  induction  formalisms  (Mitchell,  Keller  and  Kedar- 
Cabelli.  1986).  it  uses  multiple  training  instances  (solution  path  protocols)  and  an  information 
theoretic  inductive  bias  (Shannon  s  measure  of  average  information,  Quinlan,  1983).  The 
output  is  expressed  as  a  decision  tree.  It  is  not  limited  to  conjunctive  generalizations,  as  are 
some  popular  inductive  formalisms,  such  as  Version  spaces  (Mitchell,  Utgoff  &  Banerji,  1983). 
Decision  trees  can  handle  disjunctive  concepts  equally  well.  This  merely  means  that  one 

concept  (operator)  will  occupy  multiple  leaves  of  the  tree.  With  that  definition  of  the  class 
of  inductive  methods,  we  can  turn  to  examining  the  inputs  or  parameters  that  need  to  be 
given  to  Cirrus.  To  illustrate  some  of  the  computational  mechanisms,  examples  from  the 

domain  of  subtraction  will  be  employed. 

4.1.  Data  Encoding 

As  noted  earlier  a  problem  space  consists  of  states,  particularity  an  initial  state,  and 

operators  that  transform  those  states.  The  sucession  of  states  from  the  initial  to  the  goal 

state  is  termed  the  solution  path  for  that  problem  Cirrus  accepts  as  Input  the  solution  path 
employed  in  solving  a  given  problem  The  sequence  of  states  are  displayed  and  the 

operator  is  asked  to  name  the  primitive  operators  corresponding  the  state  change, 

transforming  the  data  to  a  series  of  state-operator  tuples.  Note  that  the  set  of  standard 

operators  may  need  to  be  supplemented  with  various  buggy  operators  if  the  student  has 

non-standard  primitives  Figure  1  illustrates  the  set  of  state-operator  tuples  that  rppuit  from 
encoding  a  subtraction  protocol 


T 
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4.2.  Farsing 

Parsing  converts  the  sequence  of  primitive  operators  into  a  goal  structure  according  to  a 
grammar  that  can  successfully  cover  the  input  string.  Figure  2  illustrates  the  standard  order 
grammar  that  corresponds  to  the  algorithm  most  commonly  taught  in  schools  for  performing 
subtraction.  Figure  3  shows  a  parse  tree  resulting  from  parsing  a  subtraction  problem  It  is 
clear  from  this  diagram  how  a  grammar  can  specify  the  goal  hierachy  implicit  in  an 
instructed  skill.  Note  that  the  adequacy  of  a  grammar  (and  hence  of  the  domain  theory)  is 
determined  by  its  sufficency  to  parse  the  input  string,  thus  giving  an  empirical  validation  of 
the  task  decompostition  theory 

INSERT  FIGURES  2  AND  3  ABOUT  HERE 

Cirrus  employs  a  bottom-up  parsing  algorithm  with  a  variable  constraint  mechanism.  The 
need  for  variable  constraint  parsing  deserves  closer  attention  as  it  illustrates  some  of  the 

features  of  protocol  data.  An  ordinary  context-free  parser  builds  parse  trees  that  obey 
several  constraints: 

1  Constituents  of  a  phrase  appear  in  the  order  specified  by  the  rule  that  sanctions 
building  the  phrase 

2  Constituents  of  a  phrase  abut  each  other,  you  can  not  skip  over  pieces  of  the 

string 

3  Constituents  of  a  phrase  do  not  overlap  each  other,  they  have  to  abut 

u  Constituents  of  a  phrase  have  a  category/type  that  is  specified  by  the  grammar 

rule  that  sanctions  building  the  phrase 

These  constraints  are  too  confining  for  analyzing  protocols  if  one  rushes  to  consider  'he 
possibility  of  non-determimstic  order  of  sub-goal  expansion  as  implied  b\  agenda  scnedi..,|,nq 
The  Cirrus  parser  relaxes  the  first  two  constraints  to  allow  for  such  possibilities  That  is 
constituents  need  not  be  ordered  nor  abutting  but  they  are  still  required  to  be  non- 
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overlapping  and  to  obey  type  restrictions.  If  the  grammar  hypothesis  specifies  that 
constituents  abut  or  appear  In  a  certain  order,  these  requirements  can  be  represented  as 
predicates  attached  to  the  grammar  rules.  These  predicates  act  as  constaints  such  that  the 
user  can  specify  added  requirements  beyond  those  required  by  Cirrus. 

As  discussed  previously,  the  grammar  hypothesis  specifies  the  task  decomposition.  This 
grammar  must  be  extended  to  an  annotated  grammar  to  include  specification  of  which 
method,  if  there  are  alternatives,  is  to  be  applied.  For  simplicity,  the  model  assumes  that 
in  the  case  of  a  non-deterministlc  grammar  rule  (i. e . ,  with  multiple  RHS),  the  operator  Itself 
encodes  the  information  necessary  to  determine  its  applicability. 

The  output  of  the  parsing  stage  is  a  goal-tree  structure  of  state  operator  tuples.  This  tree 
is  converted  to  an  episodic  sequence  of  tuples  by  a  tree  walk  proceedure  that  generates  the 
hypothesized  internal  states  (such  as  stack  or  agenda  contents,  and  focus  of  attention 
contents)  Thus  the  tree  walk  procedure  specifies  the  remainder  of  the  grammar  hypothesis. 
the  internal  informational  states  resulting  from  the  specified  task  decomposition  I"  this  way. 
the  internal  state  is  built  from  the  primitive  operator  sequence  specified  by  the  external 
solution  path  protocol 

4.3.  Attribute  Encoding 

Each  operator-state  tuple  is  converted  to  an  operator-attribute/set  tuple  by  running  the  set 
of  ail  domain  attribute  predicates  over  each  state  description.  Recall  that  the  state 
description  includes  the  actual  scratch  marks  made  by  the  subject  as  well  as  the 
hypothesized  internal  state  given  by  the  parsing  mechanism.  Additionally,  some  attributes 
have  access  to  the  previous  history  of  the  problem  solver  (eg  to  episodic  STM  traces) 
The  set  of  attributes  classes  that  ,ve  determined  adeouate  fot  the  domain  of  subtraction 


can  be  classified  as  follows 
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•  Internal  State  attributes 

These  attributes  encode  features  of  the  internal  state  and  are  thus  specific  to 
the  hypothesized  model  being  explored.  For  example  ( add/t  0-on-agenda , 
top-of -stack). 

•  P  jblem  State  Constant  attributes. 

nese  encode  features  of  the  problem  that  are  constant  from  state  to  state. 
For  example  (  number-zeros,  number-columns,  number-blanks). 

•  Focus  of  Attention  attributes. 

These  attributes  encode  features  of  the  external  state  with  respect  to  the  column 
of  the  subtraction  problem  that  the  student  is  currently  attending  to.  For  example 
(  top  <  bottom,  top<bottom-origmally.  bottom  =  blank). 

•  Rightmost  Unanswered  Column  attributes 

These  attributes  encode  features  of  the  external  state  with  respect  to  the 
rightmost  column  whose  answer  row  is  still  a  blank.  The  actual  features  encoded 
are  identical  to  the  focus  of  attention  attributes. 

•  History  attributes. 

These  attributes  encode  previous  events  in  the  problem  solving  history.  Most  of 
these  concern  the  previous  operator  applied,  but  some  are  flags  indicating  once 
only  events.  For  example  (just-borrowed,  borrowed-ever). 


Of  course,  new  attribute  sets  need  to  be  constructed  for  each  task  domain.  In  general,  a 
task  analysis  is  usually  sufficent  to  establish  the  universe  of  necessary  attributes.  Note  that 
some  of  the  complexity  in  the  above  description  of  thef  attribute  set  stems  from  the  use  of 
a  propositional  encoding  of  attribute-value  pairs  as  opposed  to  a  more  powerful  first-order 
predicate  encoding.  Thus  attributes  cannot  take  arguments,  such  as  column-type.  This 
explains  why  it  is  necessary  to  specify  Focus  of  Attention  problems  distinct  from  Rightmost 
Unanswered  Column  problems.  The  output  of  this  stage  is  a  sequential  list  of  jperator- 
attribute  tuples  for  each  problem 


4.4.  Decision  Tree  Construction 

The  final  stage  recursively  constructs  a  decision  tree  At  each  node,  the  most  informative 
attribute  as  defined  subsequently,  is  selected  The  operators  sorted  10  this  node  are  further 
discriminated  according  to  their  value  on  the  selected  attribute  The  process  halts  if  all  'he 
operators  sorted  to  the  node  are  identical  (e  g  the  unary  seti  or  if  there  are  no  attributes 
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remaining  which  can  discriminate  the  operators. 

4.5,  Attribute  Selection 

An  information  theoretic  measure  is  utilized  to  determine  the  choice  of  the  root  and 
subsequent  attributes  in  the  Dtree.  It  is  based  on  Shannon  and  Weaver's  dictum  that  an 
event  is  informative  to  the  degree  that  it  permits  one  to  decide  among  a  set  of  alternative 
possibilities  as  to  what  it  might  have  been.  The  justification  for  the  following  selection  criteria 
is  outlined  in  detail  in  Quinlan  (1983.  1986)  and  is  only  briefly  summarized  here. 

At  any  particular  node,  let  C  be  the  set  of  objects  sorted  to  that  node.  C  will  contain  the 
classes  of  operator  P..N  with  p..n  objects  per  class.  Let  N  be  the  total  number  of  objects  in 
C  if  a  decision  tree  were  to  classify  a  random  object  at  this  point  In  the  tree,  it  would 

assign  the  object  to  class  P  with  the  probability  P_.  Thus,  the  information  required  to 

N 

classify  an  object  as  one  of  P  N  is  then: 


I 

<p.  ■:) 


X  T,'°9.' 


I 

.V 


An  attribute  A  with  values  ; A,  .  .  AJ  will  partition  C  into  jC,  ...  C  j  leaves, 

.vhere  0.  contains  objects  in  C  that  have  value  A(  on  attribute  A.  This  set  of  objects  will 
contain  p  items  of  class  P  and  so  on  through  class  N.  Thus  the  information  that  is  given 
by  the  attribute  A  over  the  class  of  C  objects  is  a  weighted  average  of  the  total  expected 
information,  where  the  weight  for  each  branch  is  the  proportion  of  objects  in  C  that  belong 
to  thai  branch- 


FA  At 


n  i 


Thus,  the  gam  in  information  given  by  attribute  A  is  the  actual  information  needed  to 
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generate  a  classification  minus  the  expected  information: 
garni  A)  =  Up.,  n)  -  EiA) 

So.  at  each  node  in  the  Dtree.  this  quantity  is  computed  for  all  the  attributes  that  are  not 
parents  of  the  current  node.  The  attribute  which  maximizes  the  gain  of  information  at  this 
node  in  the  tree  is  selected,  and  the  operators  that  have  been  sorted  to  this  node  are 
further  discriminated  by  their  values  on  the  attribute  so  selected. 

Two  other  factors  must  be  considered  in  attribute  selection:  the  influence  of  noisy  data  on 
attribute  selection,  and  the  inflation  of  information  content  by  multiply-valued  attributes 
(Quinlan,  1986)  The  source  of  noise  could  be  misclassified  objects  on  the  basis  of  incorrect 
attribute  value  assignation  or  classification.  Thus  the  tree  building  mechanism  must  know 
when  the  attribute  set  is  unable  to  fully  distinguish  the  classes  of  objects,  and  also  when 
not  to  needlessly  complicate  the  tree  to  classify  incorrectly  valued  objects. 

The  chi-squared  method  (A2)  suggested  by  Quinlan  (1986)  is  employed  as  a  noise  filter. 
If  an  attibute  is  useful  in  classifying  an  object,  then  there  will  be  a  correlation  of  the  values 
of  the  attribute  with  the  class  of  objects  in  C.  If  an  attribute  is  irrelevant  to  the  classes  of 
object,  then  the  expected  value  p't  of  p;  will  be: 

p  +  .  .  +  n 

P  ,  *  P  • - 

p+  +  n 

This  value  can  be  utilized  as  the  expected  value  in  a  normal  chi-squared  equation,  and  the 
resulting  value  checked  against  a  stringent  confidence  interval  (p  <  OK  in  this  way 
attributes  whose  value  distribution  is  unassociated  with  class  distributions  will  not  be 
selected,  thus  giving  some  immunity  to  noise  in  the  data 

Kononenko  Bratko  and  Roskar  (1984  m  Quinlan,  1986)  report  that  the  gam  criteria 
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suggested  by  Quinlan  (1984)  is  sensitive  to  the  number  of  values  in  an  attribute,  favouring 
attributes  with  more  values.  Rather  than  limiting  feature  sets  to  binary  attributes.  Cirrus 

implements  Quinlan's  (1986)  suggestion  for  overcoming  this  selection  bias. 

The  output  from  Cirrus  is  a  graphic  tree  representation  of  the  decision  tree.  Each  node 
of  the  tree  represents  an  attribute  which,  at  that  node,  carried  the  greatest  Information 

content  in  the  overall  discrimination  of  the  operators.  The  arcs  of  the  tree  are  the  values  of 
the  discriminating  attribute.  The  leaves  of  the  tree  represent  the  operator/s.  thus  the  path 
from  the  root  of  the  tree  to  that  leaf  represents  the  rule  for  applying  that  particular  class  of 
operators  Note  that  the  most  influential  attributes  (i.e  ,  most  informative)  for  an  operators 

application  will  be  high  in  the  tree  whereas  the  more  trivial  attributes,  along  with  any 
remaining  "noise"  will  be  lower  in  the  tree.  This  concludes  our  discussion  of  the 

Implementation  details  of  Cirrus.  In  the  following  section,  we  apply  Cirrus  to  protocol  data 
from  the  domain  of  subtraction,  to  illustrate  this  style  of  induction-based  protocol  analysis. 

5.  An  Analysis  of  Subtraction  Protocols 

in  this  section,  we  apply  the  Cirrus  method  of  descriptive  analysis  to  subtraction  protocol 
data  We  will  utilize  as  our  Grammar  hypothesis  the  standard  order  grammar  depicted  in 

Figure  3. 

5.1.  Procedure 

The  data  analysed  were  12  subtraction  problems  solved  by  P.D,,  a  3rd  grade  student.  This 
student  solved  these  items  by  paper  and  pencil  test.  In  order  to  collect  the  exact  writing 
actions,  the  test  page  was  taped  to  an  electronic  tablet  and  PD  filled  out  the  test  with  a 

special  pen  Tablet  data  was  then  converted  to  a  sequence  of  character  writing  actions 

separated  by  measured  pauses  (Vanlehn  1982  1 985  VanLehn  and  Eali  198~>  Eac*1 

scratch  mark  made  on  the  page  defined  an  external  state  transition,  hence  each  problem 
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could  be  encoded  as  a  sequence  of  state  transitions.  The  problem  set  is  illustrated  in 


Figure  4.  The  set  of  problems  was  designed  so  that  Sorrow  and  Sorrow  from  Zero 


procedures  were  tested  as  well.  Figure  5  illustrates  the  sequences  of  external  states  that 


PD  wrote  in  the  solution  of  one  problem. 


INSERT  FIGURES  4  AND  5  ABOUT  HERE 


These  sequences  of  solution  path  protocols  were  encoded  as  cartesian-coordinate  problem 


representations  The  first  stage  of  Cirrus  sequentially  displayed  the  external  state  changes  of 


the  page  and  queried  the  operator  for  which  of  a  predefined  set  of  primitive  operators  could 


produce  such  a  state  change  The  set  of  operators  unambigously  corresponded  to  these 


external  state  changes.  The  list  of  normal  and  buggy  operators  that  were  found  sufficient  to 


encode  a  number  of  different  subject's  subtraction  protocols  is  given  in  Table  1.  along  with 


the  action  implied  by  the  operator  s  name.  Each  scratch  mark  made  on  the  page  defined 


an  external  state  transition,  hence  each  problem  could  be  encoded  as  a  sequence  of  state 


transitions 


The  standard  order  grammar  illustrated  earlier  was  used  to  parse  these  external  state/ 


primitive  operator  pairs.  The  internalized  state  descriptions  produced  by  the  grammar 


hypothesis,  along  with  the  attribute  set  described  previously  complete  the  input  parameter  fc 


Cirrus.  From  this  input.  Cirrus  produced  Dtrees  that  collapse  across  the  12  examples  ‘ec  v 


the  program 
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5.2.  Results 

Figure  6  diagrams  the  results  of  Cirrus  s  analysis  of  the  data2  This  tree  structure  requires 
some  explanation  before  the  actual  results  can  be  discussed  The  nodes  of  the  tree 
propose  an  attribute  whose  values  are  maximally  informative  in  discriminating  operators  in  the 
subtree  below  the  node  Attributes  that  are  higher  in  the  tree  are  more  informative  than 

those  lower  in  the  tree  Values  of  the  attribute  (often  just  true  or  false)  label  the  links 

between  nodes.  The  tree  can  be  seen  as  "sorting"  operators  to  its  leaves.  The  operator 
appears  in  a  box  under  the  leaf  When  two  or  more  operators  appear  together  in  a  box. 
that  means  either  no  state  features  were  capable  of  discriminating  the  operators  or 
senseless  attributes  were  being  used  in  a  'last  ditch"  effort  to  seperate  operators  that 

reasonably  should  not  be  discriminated  The  one  case  where  this  occured  is  discussed 
below. 

This  tree  needs  to  be  considered  in  conjunction  with  the  goal  tree  (grammar  hypothesis) 
proposed  for  subtraction  (Figure  2)  First  consider  the  leftmost  branch  of  the  dtree.  The 

operators  Sub  SubiColumn.  ShowTop  and  Decrement  can  be  correctly  "scheduled"  by 
reference  only  to  the  state  of  the  agenda.  Note  that  primitives  pop  themselves  from  the 
agenda  when  completed,  as  do  goals  when  their  subgoals  have  all  been  completed.  Also. 
ShowTop  is  assumed  to  have  been  placed  on  the  agenda  rather  than  Diff  due  to  method 
information  encoded  in  the  operator,  as  to  its  applicatility  So  if  we  know  that  we  want  to 
do  a  subtraction  problem  and  we  have  not  done  anything  yet  (i  e  .  the  agenda  is  empty) 
then  we  apply  the  goal  operator  Subtract.  Likewise,  if  the  subgoal  to  take  the  difference  has 
not  yet  appeared  on  the  agenda,  then  we  want  to  subtract  one  column  and  so  on  What 
is  significant  here  is  that  there  is  sufficent  information  just  in  what  has  appeared  on  mo 
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agenda  to  schedule  these  operators  with  out  reference  to  any  external  state  or  subgoal 
ordering 


INSERT  FIGURE  S  ABOUT  HERE 


The  remainder  of  the  tree  divides  into  two  main  subtrees.  Note  that  if  the  goal  to  take 
the  difference  of  two  numbers  (Dift)  has  been  put  on  the  agenda,  then  we  are  either  going 
to  be  able  to  just  take  the  difference,  or  we  will  need  to  borrow  (called  Regroup  here)  This 
is  expressed  by  the  middle  branch  of  the  tree,  and  of  course,  we  decide  between  these  two 
alternatives  according  to  whether  the  top  digit  is  less  that  the  bottom  digit  in  the  right-most 
unanswered  column  (RUC.T<B).  Note  that  this  is  the  only  feature  of  external  state  chosen 
by  the  tree  if  Regroup  has  already  been  chosen  and  expanded,  this  means  that  the 
operator  Add  10  must  be  on  the  agenda  (Add/10  ON  AGENDA).  This  attribute  discriminates  at 
a  higher  level  of  the  tree  whether  to  process  a  column  (Diff  or  Regroup)  or  complete  the 
borrowing  process 

The  rightmost  branch  of  the  tree  deals  with  scheduling  the  operators  to  complete  the 
borrow  procedure  interestingly,  the  ordering  of  these  operations  (Decrement.  Add/10. 
ScratchMark.  From  and  Regroup)  is  probably  the  least  determined  by  the  standard  order 
algorithm  as  taught  by  schools  This  is  reflected  m  the  Dtree  attribute  nodes,  which  refer  to 
a  variety  of  internal  states,  including  memory  states  For  PD.  Scratchmark  is  the  first 
operator  executed  whenever  it  appears  on  the  agenda  This  is  sensible  since  this  mark  is  a 
temporary  reminder  to  decrement  if  we  have  made  a  scratch  mark  f  Jusr/Scnatcnm  then 
we  are  going  to  Aga/W  or  Deer  the  remaining  two  operators  left  m  the  bcrro.v  3eauence 
The  actual  output  dtree  could  differentiate  these  operators  t.  accea|'ng  *o  os  .o^oioo^  a11, 
implausible  attributes  (Number  of  top  zero  s  and  whether  the  top  digit  was  origina  Hy  zero' 


Sublets  themselves  show  greater  variability  m  scheduling  these  two  operators  both  between 
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subjects  and  even  within  subjects  in  the  same  problem.  Since  the  order  of  applying  these 
does  not  affect  the  outcome  of  the  procedure  (i  e  ,  its  correctness)  perhaps  the  dtree  is  just 
reflecting  a  basic  Indeterminacy  in  this  scheduling  and  selecting  only  spuriously  associated 
state  features.  (Recall  that  the  attributes  lower  in  the  tree  are  less  important  and 
predictive).  For  this  reason,  we  show  the  operators  together 

Finally,  the  remainder  of  the  tree  uses  a  memory  trace  to  schedule  the  remaining  Add/iO 
operators  directly  after  a  decrement  operation  These  Add/10  operators  are  different  from 
the  previous  group  of  Add/10  operators  in  that  these  do  not  occur  in  the  context  of 
borrowing.  Alternatively,  if  one  has  not  yet  performed  the  basic  operators  for  borrow.  (Deer 
Add/10.  ScratchMark )  tnen  one  executes  the  planning  operators  for  borrowing;  From  and 
Regroup.  These  operators  are  |ust  ordered  by  their  appearance  on  the  agenda.  Regroup 
being  performed  first  (which  indeed  it  must  before  From  can  be  executed) 

Thus  the  analysis  of  this  Dtree  suggests  that  the  Cit..'S  method  of  descriptive  analysis  tells 
us  the  information  used  by  this  student  m  performing  the  complete  range  of  subtraction 
skills  It  has  certainly  presented  a  coherant  theory  of  the  skills  required  to  control  the 
execution  of  a  subtraction  procedure  (search  control)  Subtraction  for  multi-column  problems 
is  viewed  as  a  procedure  rather  than  a  skill  requiring  search  This  is  reflected  in  the  Dtree 
in  that  only  one  external  state  feature  is  needed  to  effectively  sequence  a  complex  collection 
of  skills.  The  top-most  ordering  for  this  skill  derives  from  the  task  decomposition  given  in 
the  grammar.  This  analysis  also  predicts  that  the  only  intermeadiate  results  that  PD  needs  to 
store  m  working  memory  (Just/Sratcdmark  and  Just/Decn  are  involved  in  sequencing  the  more 
complex  Borrow  operation  One  could  speculate  that  the  reason  students  have  more  feub'e 
m  acquiring  borrowing  skills  is  due  to  this  extra  memory  demand  interesting^,  ‘he  resui!s 
demonstrate  that  reiativley  unstructured  ordering  principles  suen  as  agenda  s  are  np.ertne'ess 
sufficient  to  schedule  complex  operator  sequences  with  very  little  additional  information 
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in  the  next  section,  these  results  are  related  to  a  wider  view  of  what  can  be 

accomplished  with  the  use  of  such  an  automated  method  of  analysis 

6.  Discussion 

We  take  the  previous  example  as  evidence  for  the  utility  of  a  descriptive  theory  of 

cognition.  However,  how  might  such  results  be  utilized'7  Cirrus  is  seen  as  having  three 
rnaior  applications.  These  three  applications,  data  reduction/analysis.  student  modelling  and 
architectural  hypothesis  testing  are  described  with  regard  to  the  problem  space  hypothesis. 
using  the  data  just  analysed 

Recall  that  for  the  problem  space  hypothesis,  problem  solving  and  other  phenomena  are 
viewed  as  search  within  the  domain  of  the  problem  space.  One  way  we  could  describe  the 
process  of  operator  selection  ( search  control)  is  to  know  which  features  (attributes)  of  the 
internal  and  external  states  are  most  predictive  (and  hence,  presumably  causal)  of  operator 
application  By  knowing  these  features,  we  would  know  what  external  cues  and  internal 

states  of  the  information  processor  "drove"  the  sequence  of  operators  observed.  Previous 
approaches  to  protocol  analysis  (Bhaskar  and  Simon.  1977:  Ericsson  and  Simon,  i984i 
offered  no  mechanistic  way  that  such  an  analysis  could  be  conducted.  The  model  Cirrus 
described  in  this  paper  offers  an  automated  method  for  exploring  these  kinds  of  data 
analysis,  in  the  next  section,  we  propose  three  difficulties  in  the  traditional  approach  to 

protocol  analysis,  and  describe  how  Cirrus  might  answer  these  difficulties 

6.1.  Limitations  of  Protocol  Analysis  of  Solution  Paths 
As  a  method  of  enquiry,  protocol  analysis  of  solution  cath  data  has  several  unresr'-ed 
ssues  that  stem  from  the  above  noted  need  *o  describe  the  internal  states  of  ,he  crcce3scr 
and  the  external  cues  that  predict  operator  a  colic  at  'Cn 

•  To  date  most  empirical  studies  of  the  problem  space  hypothesis  have  concerned 
the  protocol  analysis  of  single  subjects  No  adequate  technology  as  yet  exists  for 
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comparing  and  contrasting  the  solution-paths  of  multiple  data-sets  Nor  has  there 
been  a  means  of  reducing  the  data  of  solution  paths  so  that  only  the  essential 
features  are  considered  There  is  not  the  means  for  conducting  hypothesis  tests 
of  group  differences,  a  standard  resouce  for  data  comparison.  This  limits 

protocol-based  studies  to  descriptive  rather  than  evaluative  formats.  We  term  this 
the  need  for  data  reduction/analysis 

•  The  output  of  protocol  analysis  is  potentially  useful  in  many  facets  of  student 

modelling,  from  assesment  of  learned  procedures  to  intelligent  tutoring  system 
uses  However,  it  is  at  present  not  feasible  to  utilize  such  data  In  any  "real 
time"  sense,  since  protocol  analysis  involves  tedious  handcoding  of  large  amounts 
of  data,  and  the  promulgation  of  many  "intuitive"  but  untested  assumptions 

There  is  a  need  to  automate  this  analysis  process  so  that  student  models  can 
be  built  in  real-time,  with  a  clearly  defined  set  of  assumptions.  We  term  this 

the  need  for  automated  student  modelling. 

•  The  problem  space  hypothesis  leaves  under-determined  the  ordering  principle, 

control  processes  that  schedule  the  application  of  operators  to  achieve  the  goal 
state  There  are  many  applicable  models  in  the  artificial  intelligence  literature, 
from  planning  models  to  blind  search  models,  but  there  has  been  no  easy  way 
tc  test  the  architectural  assumptions  with  respect  to  their  fit  to  the  data. 

Anderson  (1983)  has  argued  that  such  an  issue  may  in  fact  be  undecidable. 
since  an  infinitude  of  machines  can  model  arbitary  input/output  relationships. 
Nevertheless,  such  models  of  internal  architecture  have  been  proposed  in  the 
literature  We  feel  that  there  is  a  need  to  evaluate  such  descriptive  models 
against  real  solution  path  data  We  term  such  an  evaluation  of  ordering  principle 
adequacy 


These  three  issues,  of  data  reduction/analysis  ordering  principle  testing  and  automated 
student  modelling  are  elaborated  next  in  greater  detail,  along  with  an  indication  as  to  how 
Cirrus  may  solve  these  problems 


6.2.  A  Basis  for  Data  Reduction. 

There  are  two  cnterial  ways  in  which  solution  oath  data  may  be  utilized.  Firstly,  we  wish 
to  Know  which  of  the  universe  of  features  of  a  problem  state  are  the  relevant  ones,  that  is: 
what  aspects  of  the  data  are  causal  or  correlated  with  the  outcome9  Secondly  how  can 
different  sets  of  data  once  the  critical  aspects  of  the  data  have  been  found  be  compared 


The  first  problem  ,$  one  of  induction  over  the  universe  of  attributes  That  1$  .•.men 
atf'butes  are  most  highly  predictive  of  an  operator  application  This  information  is  directly 
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available  as  the  result  of  Cirrus's  analysis  Refering  back  to  Figure  6,  we  can  see  what  Is 
important  in  learning  how  to  execute  a  subtraction  procedure.  For  the  operators  that 

establish  the  topmost  goals  (Sub.  SublCof)  in  subtraction,  only  their  order  of  appearance  on 
the  agenda  is  required.  Similanly.  if  one  does  not  take  the  difference  In  a  column  (l.e  ,  Ditt 
not  on  agenda),  then  one  will  just  write  down  the  top  digit  (  ShowTop).  From  this,  it  is 
tempting  to  hypothesize  that  learners  learn  just  the  sequence  of  activities  for  these 

operators 

Obviously  the  most  crucial  piece  of  information  is  whether  top  <  bottom,  in  order  to 
decide  whether  to  take  the  difference  or  to  initiate  borrowing.  Thus  the  borrowing  procedure 
is  executed  when  we  have  two  conflicting  pieces  of  evidence,  if  Ditt  is  on  the  agenda,  but 

top  <  bottom,  then  we  must  delay  executing  Ditt  and  do  Regroup  first.  Thus  for  this 

operator  learned  sequences  are  contingent  on  external  states.  Likewise,  ordering  of 
operators  may  be  contingent  on  the  history  of  previous  actions,  as  it  is  for  Add / 70  or 
decrementing,  which  occur  after  making  a  scratch  mark  on  the  page. 

Cirrus  then,  is  capable  of  seperatmg  from  state  representations  what  is  important 
knowledge  in  skilled  performance  For  further  analysis,  one  may  wish  to  assess  the  extent 
of  differences  between  subjects  solution  paths  as  a  function  of  different  treatment  or 
sampling  conditions  Previous  protocol  approaches  that  have  tried  to  compare  across  groups 
have  assumed  a  metric  scaling  space  and  rated  the  protocols  on  various  dimensions  within 
this  SDace.  assessing  similarity  or  difference  by  traditional  statistical  methods  However  this 
approach  is  unsatisfactory  for  several  reasons  Firstly  the  measures  that  are  inferred  Hem 
the  data  are  those  ’hat  fit  the  subjective  view  of  the  experimenter  as  to  what  is  interesting 
to  examme  Secondly  Tversky  (1977)  has  questioned  the  /aiiditv  of  the  gecmetic  apccacH 
m  that  the  assumption  of  a  metric  scaling  space  'S  often  un'enacie  where  judgements  of 
similarity  are  concerned 
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Tversky  suggests  instead  a  contrast  model  of  analysis,  employing  a  linear  combination  of 
common  and  different  feature  sets  to  assess  similarity  Such  a  matching  function  can 
measure  the  degree  to  which  two  objects,  viewed  as  a  set  of  features,  match  each  other 
Such  an  analysis  needs  a  comparison  of  feature  set  commonalities  and  differences  rather 
than  a  computation  of  metric  distances  between  points  of  inferred  constructs.  However, 
before  such  a  matching  function  can  apply,  there  is  a  prior  necessary  process  of  the 

extraction  and  compilation  of  the  relevant  set  of  features  maximally  associated  with  an 
operator  application  The  model  Cirrus  performs  the  preliminaries  to  such  an  analysis,  by 
extracting  the  critical  features  from  the  universe  of  attributes 

In  summary,  we  perceived  a  need  to  discern  the  critical  features  of  external  and  internal 
state  that  maximally  predict  an  operator's  application  Such  an  analysis  can  firstly  be 

employed  for  data  reduction,  so  that  complex  protocol  data  can  be  understood  by  noting 

only  the  critical  features.  Futher,  this  type  of  data  reduction  could  be  futher  employed  as 

input  to  an  analysis  of  feature  similarities  and  differences,  that  would  serve  as  a  basis  for 
within  or  between  subject  comparision.  Cirrus  exactly  meets  these  requirements. 

6.3.  Automated  Student  Modelling 

There  is  a  further  extension  of  the  model  beyond  a  theoretical  basis  for  data  reduction,  if 
the  assumptions  of  the  model  (the  grammar  hypothesis  and  the  ordering  hypothesis)  are 
appropriate  for  human  problem-solvers,  then  Cirrus  can  serve  as  a  model  for  the  process  of 
operator  selection,  that  is.  a  student  model  This  is  a  stronger  theoretical  claim  than  the 

previous  one  of  determining  the  state  features  that  ’cause"  an  operator  application  Here 

we  are  saying  that  the  process  utilized  by  Cirrus  to  discriminate  when  a  particular  operator 
should  be  applied  may  be  analogous  to  the  process  a  student  utilizes  when  solving  3 
problem  That  is.  Cirrus  m  this  role  is  describing  the  actual  search  control  mechanism 

employed  by  a  problem  solver  Thus,  the  output  of  Cirrus  (decision  trees),  would  be 
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hypothesized  to  be  control  knowledge  employed  by  the  problem  solver. 


Domains  such  as  subtraction,  algebra  and  physics  fall  readily  into  the  Problem  Space 


hypothesis  approach,  where  the  basic  manipulations  constitute  the  operators  within  the 


problem  space.  The  task  for  the  student  is  to  apply  the  correct  sequence  of  operators  to 


move  from  the  initial  to  the  goal  state.  Such  learning  may  be  termed  Search  control.  How 


this  knowledge  is  induced  from  examples,  and  the  form  of  this  knowledge  are  interesting 


questions.  We  imagine  that  the  form  of  this  knowledge  could  be  described  as  falling  along 


what  may  loosely  be  described  as  an  external-internal  state  descriptor  dimension  At  one 


pole  is  the  kind  of  search  control  knowledge  described  by  Lewis  and  Anderson  (1985)  as 


schema  abstracted  from  problem  examples  that  predict  when  and  when  not  a  particular 


operator  will  work.  These  schemata  contain  certain  problem  features  that  are  properties  of 


the  problem  diagram  and  information  contained  in  the  problem  description.  They  conclude 


that  search  control  can  be  described  as  a  correlation  between  surface  features  of  the 


problem  state  (external  state)  and  the  correct  rules  of  inference  for  these  problems. 


At  the  other  pole  are  models  which  propose  that  what  is  learnt  as  search  control  is  an 


internal  structure  which  guides  the  sequence  of  operator  application.  As  an  example. 


VanLehn  (1985)  argues  that  problem  solvers  learn  a  goal  hierarchy  which  specifies  the  order 


of  operator  application  Thus,  there  are  internal  features  of  the  processor  state  (internal 


state)  that  when  learnt,  can  guide  the  execution  of  a  complex  skill.  Inevitably,  search  control 


will  consist  of  a  mixture  of  external  state  features  and  internal  state  features.  In  terms  of 


modelling  the  search  control  acquired  by  a  learner  it  would  be  instructive  to  know  what 


these  critical  features  are 


Cirrus  exactly  describes  what  the  balance  >s  be  veen  search  \pe  attributes  and  -m 


procedural  attributes  Given  solution  path  protocols  Cirrus  can  now  determine  the  features 
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of  the  internal  and  external  state  that  the  student  has  most  likely  employed  in  reaching  a 
solution  This  data  can  then  be  employed  in  student  modelling  applications,  be  It  evaluation 
of  the  search  control  strategies  employed  by  the  student  for  research  purposes,  diagnosis  of 
incorrect  operator  application  caused  by  attention  to  incorrect  features  of  the  problem-state 
in  teaching  situations,  determining  differential  procedures  following  differing  teaching  histories, 
or  building  representations  of  the  student  in  ICAI  applications. 

6.4.  Evaluation  of  Decomposition  and  Ordering  Principles 

The  two  most  significant  free  parameters  of  Cirrus  are  the  ordering  hypothesis  and  the 

task  decomposition  hypothesis.  Both  of  these  can  be  varied  and  the  resultant  models 
evaluated  on  a  "what  if?"  basis.  We  have  argued  for  an  agenda  ordering  structure  on  the 
grounds  of  its  generality  However,  we  could  postulate  more  constrained  ordering  principles. 

For  example,  we  may  postulate  that  a  subject  uses  a  stack  architecture.  That  Is.  the  plan 

or  procedure  that  the  subject  follows  is  represented  following  a  stack  regime,  such  that 

when  all  the  sub-goals  of  a  goal  are  accomplished,  then  that  goal  can  be  popped  off  the 
stack.  Note  that  the  stack  model  is  a  more  restricted  case  than  the  agenda  regime  used 

here  The  primary  feature  of  stack  regimes  is  deterministic  ordering  of  subgoal  expansion, 

against  non-determimstic  expansion  order  in  an  agenda 

The  choice  of  ordering  principle  suggests  the  set  of  internal  attributes  to  supplement  the 
already  established  set  of  external  attributes  For  example,  m  the  case  of  a  stack 

architecture,  relevant  attributes  may  be  top  of  stack  and  depth  of  stack.  Finally,  the  set  of 
operator  states  (primitives  plus  sub-goai  operators  determined  by  the  hypothesized 

architecture)  resulting  from  the  parsed  problem  together  with  the  concommittant  encoding  of 
attribute  sets  are  utilized  to  construct  a  decision  tree  The  decision  tree  represents  he 
control  structure  that  directs  the  execution  of  the  procedure  on  the  hypothesized  architecture 
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Different  architectural  models  and  the  resultant  control  structures  may  then  be  compared 
with  each  other.  We  have  no  formal  way  of  contrasting  the  resultant  Decision  trees,  instead 
we  rely  on  the  following  heuristics  to  evaluate  "goodness  of  fit".  The  ordering  principle  that 
utilizes  a  minimal  set  of  features  to  produce  decision  trees  of  maximal  simplicity  and 
psychological  plausibility  is  selected  on  the  grounds  of  parsimony.  Further  bases  for 
compansion  arise  from  the  fact  that  building  such  an  "executable"  model  of  search  control 
demonstrates  the  limitations  inherant  in  the  proposed  architecture.  For  example,  the  above 
illustrated  stack  model  architecture,  with  its  order  determined  sequence  of  subgoal  expansion 
may  be  too  inflexible  to  account  for  actual  solution  path  data  in  a  plausible  manner. 

The  ability  to  manipulate  such  parameters  and  estimate  their  effect  on  the  knowledge 
structures  depicted  by  the  resultant  Dtrees  seems  advantageous  in  the  construction  of 
theoretical  models. 

6.5.  Limitations  and  Problems 

The  heart  of  the  Cirrus  model  is  the  inductive  method  employed  in  selecting  attributes.  We 
have  noted  that  the  information  theoretic  model  employed  falls  into  the  cue  validity  class  of 
induction  models  That  is.  attributes  are  selected  on  their  ability  to  discriminate  between 
operator  classes.  (Here,  the  basis  of  discrimination  is  information  content).  However,  we 
have  no  guarantee  that  this  is  the  strategy  employed  by  human  subjects  indeed,  there  is 
evidence  that  in  the  domain  of  concept  formation,  humans  utilize  category  validity  methods 
Here,  attributes  are  selected  on  their  correlation  with  the  set  of  other  attributes  defining  a 
concept  (Medin  Wattenmaker  and  Michalski.  in  press:  Rosch  and  Mervis  1975'  The 

difference  between  the  knowledge  structures  arising  from  cue  versus  category  methods  has 
yet  to  be  explored  and  an  extension  of  the  Cirrus  model  would  be  to  incorporate  an 
alternative  category-validity  attribute  selection  method 
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The  model  uses  an  attribute-value  encoding,  a  relatively  weak  propositional  formalism  for 
knowledge  representation.  More  compact  and  powerful  state  descriptions  may  result  from 
adopting  a  first  order  predicate  state  representation,  particularily  in  expressing  relational 
information  The  extent  of  the  limitation  from  a  propostional  representation  is  not  yet  clear 


6.6.  Conclusions 

In  the  introduction,  we  argued  that  theories  of  cognition  that  drew  their  sources  of 
theoretical  constraints  from  functional  analyses  of  task  domains  should  limit  their  hypotheses 
to  functionally  based  statements  about  competence.  In  the  analysis  presented  here,  we  have 
taken  a  theory  of  how  subjects  decompose  the  task  of  subtraction  (the  decomposition 
hypothesis)  and  generated  a  representation  of  a  problem-solvers  internal  and  external  state 
constrained  by  this  theory  From  this,  we  have  been  able  to  extract  the  knowledge 
structures  utilized  by  the  problem  solver,  that  adequately  explain  the  competence 
demonstrated. 


In  doing  so.  we  have  not  needed  to  hypothesize  an  architecture  that  would  serve  to 
execute  this  knowledge.  We  believe  that  this  is  an  important  advantage  of  the  current 
method  Not  only  does  it  obliviate  the  need  for  ad  hoc  test  and  generate  cycles  to 
determine  the  "rule  structure"  of  a  particular  task  domain  (as  would  a  conventional 
production  system  analysis),  but  it  does  not  build  structural  mechanisms  based  only  on 
functional  constraints.  Thus  there  is  no  temptation  to  reify  descriptive  mechanisms  to  the 
status  of  an  actual  structural  mechanism.  Neither  have  we  elevated  the  regularities  found  in 
a  sublet's  performance  to  the  status  of  a  rule,  that  is.  something  that  is  explicitly 
represented  in  the  neural  substrate.  Instead  we  have  described  observed  competence  and 
can  leave  theories  concerning  how  such  competence  may  be  represented  and  processed  to 
theories  that  concern  the  structural  requirements  for  computation 
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Finally,  we  have  sketched  how  such  a  method  of  analysis  may  be  applied  to  interesting 

problems.  We  considered  the  problems  of  automated  analysis  of  protocols,  particularity  online 

protocols  of  say,  a  student  in  a  tutoring  situation.  We  demonstrated  how  Cirrus  could 

meaningfully  extract  a  model  of  the  students  competence.  The  applications  of  this  to 

intelligent  tutors  that  need  to  build  models  of  that  same  competence  are  apparent.  We 

briefly  considered  how  the  induction  method  of  Cirrus  might  be  used  to  perform  group 

analyses  of  protocol  data,  and  suggested  how  different  domain  theories  might  be  compared 
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